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Abstract 



Recognizing shallow linguistic patterns, such as basic syntactic relationships between words, 
is a common task in applied natural language and text processing. The common practice for 
approaching this task is by tedious manual definition of possible pattern structures, often 
in the form of regular expressions or finite automata. This paper presents a novel memory- 
based learning method that recognizes shallow patterns in new text based on a bracketed 
training corpus. The examples are stored as-is, in efficient data structures. Generalization is 
performed on-line at recognition time by comparing subsequences of the new text to positive 
and negative evidence in the corpus. This way, no information in the training is lost, as can 
happen in other learning systems that construct a single generalized model at the time of 
training. The paper presents experimental results for recognizing noun phrase, subject- verb 
and verb-object patterns in English. 



1 Introduction 



Identifying local patterns of syntactic sequences and relationships is a fundamental task in 
natural language processing (NLP). Such patterns may correspond to syntactic phrases, like 
noun phrases, or to pairs of words that participate in a syntactic relationship, like the heads 
of a verb-object relation. Such patterns have been found useful in various application areas, 
including information extraction, text summarization, and bilingual alignment. Syntactic 
patterns are useful also for many basic computational linguistic tasks, such as statistical 
word similarity and various disambiguation problems. 

One approach for detecting syntactic patterns is to obtain a full parse of a sentence and 
then extract the required patterns. However, obtaining a complete parse tree for a sentence 
is difficult in many cases, and may not be necessary at all for identifying most instances of 
local syntactic patterns. 

An alternative approach is to avoid the complexity of full parsing and instead analyse a 
sentence at the level of phrases and the relations between them, this is the task of shallow 
parsing (lAbney, 1991| ; |Greffenstette, 1993|) . In contrast to full parsing, shallow parsing can 
be achieved using local information. Previous works (cf. section |2]^) have shown that it is 
possible to identify most instances of shallow syntactic patterns by rules that examine only 
the pattern itself and its nearby context. Often, the rules are applied to sentences that were 
tagged by part-of-speech (POS) and are phrased by some form of regular expressions or finite 
state automata. 

Manual writing of local syntactic rules has become a common practice for many applica- 
tions. However, writing rules is often tedious and time consuming. Furthermore, extending 
the rules to different languages or sub-language domains can require substantial resources 
and expertise that are often not available. As in many areas of NLP, a learning approach is 
appealing. 

Abney (|1991|) introduced the notion of chunks, denoting sequences of words with a certain 
syntactic function (more in section p.ll) . The task of dividing a sentence into meaningful 
sequences of words is thus referred to as chunking. The most studied chunking task is that 
of identifying noun phrases (NPs, e.g. [Church (1988|) , [Ramshaw and Marcus (1995|) , [Cardie 
and Pierce (1998[ ), [Veenstra (1998[ )), naturally due to their central role in a sentence. 

This paper presents a novel general learning approach for recognizing local sequential 
patterns, that falls within the memory-based learning paradigm. The method learns from a 
POS tagged training corpus in which all instances of the target pattern are marked (brack- 
eted). Subsequences of the training examples are stored as-is in a trie, thereby facilitating a 
linear-time search for subsequences in the corpus. While the presented algorithm is oriented 
at recognizing sequences, it is useful also for grammatical relations such as subject-verb, 
and verb-object. We present these applications, in which the relations are represented as 
sequences encompassing the relevant words. The algorithm is therefore suitable for learning 
to perform tasks involved in shallow parsing. 

The memory-based nature of the presented algorithm stems from its induction strategy: 
a sequence is recognized as an instance of the target pattern by examining the raw train- 
ing corpus, searching for relevant positive and negative evidence. No model is created for 
storing the training corpus, and the raw data are not converted to any other representation. 
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Modelling lies in the choice of the kinds of data retrieved from the stored corpus, that is, in 
the substrings being searched in the memory. We believe this choice to be highly relevant 
for linguistic data. 



The POS tag set used throughout the paper is the Penn TreeBank set ([Marcus, San- 



torini, and Marcinkiewicz, 1993[): DT = determiner, AD J = adjective, RB = adverb, VB=verb, 



PP=preposition, NN = singular noun, and NNP = plural noun. As an illustrating example. 
Suppose we want to decide whether the candidate sequence 

DT ADJ ADJ NN NNP 

is an NP using information from the training corpus. Finding the entire sequence as-is 
several times in the corpus would yield an exact match. Due to data sparseness, however, 
that cannot always be expected. 

A somewhat weaker match may be obtained if we consider tiles, sub-parts of the candi- 
date sequence. For example, suppose the corpus contains noun phrases with the following 
structures: 

(1) DT ADJ ADJ NN NN 

(2) DT ADJ NN NNP 

The first structure provides positive evidence that the sequence 'DT ADJ ADJ NN' is a pos- 
sible NP prefix while the second structure provides evidence for 'ADJ NN NNP' being an NP 
suffix. Together, these two training instances provide positive evidence that covers the entire 
candidate. Considering evidence for sub-parts of the pattern enables us to generalize over 
the exact structures present in the corpus. Similarly, the algorithm considers the negative 
evidence for such sub-parts by noting where they occur in the corpus without being a cor- 
responding part of a target-pattern instance. Surrounding context and evidence overlap are 
considered as well. 

Other implementations of the memory-based paradigm for NLP tasks include paelemans 



et al. (19961) , for POS tagging; [Cardie (1993| ), for syntactic and semantic tagging; and ^tanfill 



and Waltz (1986| ), for word pronunciation. In all these works, examples are represented as 



sets of features and the induction is carried out by finding the most similar cases. The 
memory-based works of Bod (1992[ , DOP) for parsing, and Yvon (1996 ) for pronunciation 



use the raw form of the data, rather than encode it as features. The method presented here 
is similar in that it makes use of raw sequential data, and generalizes by reconstructing test 
examples from different pieces of the training data. 

Previous related work is described in section Section ^ describes the inference algo- 
rithm formally; section ^ presents experimental results for three target syntactic patterns in 
English, along with comparison with related results. 



2 Background 

We present here a brief overview of the state-of-the-art in shallow parsing, covering both 
hand-crafted and learnable parsers. In addition, since shallow parsing can be viewed as a 



sub-task of full parsing, we also discuss relevant methods for learning full parsing. Section 2.1 
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presents some of the current hand-crafted shallow parsers, while sections 2^ and 2^ present 
methods for learning full and shallow parsing respectively. 



2.1 Shallow Parsers with Hand- Written Rules 

Much work to date on shallow parsing has been based on hand-crafted sets of rules, generally 
using a probabilistic model to choose between parse alternatives. 

Abney (1991, 1996) has pioneered work on shallow parsing. His work was motivated 
by psychohnguistic evidence ( |Gee and Grosjean, 1983|) , indicating that language processing 
involves dividing the sentence into performance structures — corresponding to pauses and 
intonation changes during speech. Abney introduced the concept of chunks, defined as 
consisting of 'a single content word surrounded by a constellation of function words, matching 
a fixed template'. His chunk parser operates in two phases: a chunker which offers potential 
chunks, and an attacher which resolves attachment ambiguities and selects the final chunks. 
The chunker makes use of POS data, whereas the attacher requires lexical information. That 
way, lexical information is used only for the tasks for which it is more important, and simpler 
POS information is used for the more basic task of chunking. Both parts of Abney's chunk 
parser are implemented as non-deterministic LR parsers. The distinction between chunking 
and attachment is common to all the systems presented in this section. 

Ai't-Mokhtar and Chanod (1997) presented a sequence of finite-state transducers for ex- 
tracting subjects and objects from POS-tagged French texts. Extraction is incremental, each 
processing phase is carried out by a dedicated transducer, which prepares the input for the 
next phase. The system relies only on POS information, that is, it does not require lexical 
information. For various corpora, the recall and precision for subjects were above 90.5% and 
86.5% respectively, whereas for objects these figures exceeded 84.4% and 79.9% . 

Much shallow parsing effort has been motivated by information-extraction (IE) tasks 
such as MUG. For example, the FASTUS ( |Appelt et al., 19931) system uses cascaded, non- 



deterministic finite-state automata for extracting noun groups, verb groups, and particles. 
As an IE system, FASTUS is built for cases where only part of the text is relevant, and the 
target patterns can be represented in a simple rigid fashion. 

Schiller (1996) presented a finite-state multilingual system for NP detection. This system 
is also intended for simple phrases, without coordinations or relative clauses. 

The SPARKLE (Shallow PARsing for acquisition of Knowledge for Language Engineer- 



ing, [tittp : //www, ilc .pi . cnr . it/ sparkle/sparkle .html) project aims mainly in develop- 



ing 'robust and portable tools leading to commercial applications devoted to the management 
of multilingual information in electronic form'. Phrasal- level syntactic analysis is essential for 
the objective of the project; the resulting parses will serve for acquiring lexical information. 
Part of the SPARKLE project is thus devoted to developing shallow parsers. 

• The shallow parser for English is developed in Gambridge and Sussex universities. 
Parsing is carried out by a generalized LR parser, which uses a unification-based phrasal 
grammar of POS tags. The parser performs disambiguation based on a probabilistic 
model, that is, the rules are fixed, but their probabilities are being learned. The 
reported recall and precision are 82.5%/83% for phrases (without lexical information) 
and 88.1%/88.2% for grammatical relations (including lexical information). 
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• In the system for German, developed at the University of Stuttgart, the grammar rules 
are written in a bottom-up fashion. The parser is a standard chart parser, extended to 
include head-markings for lexicalization. The parser is also enhanced to carry out EM 
estimation of hand- written CFG rule probabilities. That way, both parsing and lexical 
acquisition are integrated in a single process. Final evaluation data were not available. 

• The system developed for French, at Rank- Xerox Research Centre (RXRC) is a collec- 
tion of finite-state transducers. The input is a POS-tagged text, and each transducer 
inserts markup symbols for the corresponding pattern. Parsing is carried out by in- 
voking the transducers bottom-up, starting from the easier tasks, using the transducer 
output as an input to the next one. Once the text is marked, another transducer 
identifies the head words. That information is used in the last phase, by a transducer 
which identities syntactic functions. 



2.2 Learning: Full Parses 

Full parsing learning methods are applicable in general to shallow parsing, where they can 
be used for extracting partial parses. 

Some methods aim at estimating probabilities of hand-written grammar rules, based on 
annotated corpora. These include the works of [Pereira and Schabes (1992|) for CFG, [Chitrao 



and Grishman (19901) for context-sensitive grammar, [Eisner (1996|) for dependency parsing 



Magerman and Marcus (1991| ) for bottom-up chart parsing, and [Briscoe and Carroll (1993D 
for attribute-value grammars. In the rest of the section we present works which do not use a 
certain set of grammar rules, but induce the parsing from statistical features extracted from 
the training data. 

Brill (1993) used transformation-based-learning for learning to produce an unlabeled 
parse tree. Given a POS-tagged sentence, the learned transformations manipulated insertion 
and deletion of left and right parenthesis depending on the current POS tag and, possibly, 
the POS tag of the previous or next word. 

SPATTER ( [Magerman, 1995[ ), is a learning algorithm for parsing which makes use only 



of hierarchical structure information. It learns a decision-tree model for tagging and parsing, 
where parsing decisions build a hierarchical structure in a bottom-up manner. The questions 
at the nodes of the tree take into account neighboring as well as child nodes; the result is a 
complete parse tree. It scores 84% recall and 84.3% precision on Penn TreeBank WSJ data. 
The parser of [Colhns (1997| ) is based on statistics of lexical dependencies, subcatego- 



rization and wh-movement. His parser scored 88.1% recall and 87.5% precision on Penn 
TreeBank WSJ data, currently the best result on that dataset. 

Recently, [Sekine (1998[ ) built a system based on learning parse rules for five non-terminals: 
sentence, subordinate sentence, infinitive sentence, NP, and base NP; the system also uses 
lexical dependency information. In addition his system is capable of performing 'fitted 
parsing' — combining partial trees or chunks built by these rules, producing a complete 
parse. With combined chunks restricted to S nodes or punctuation marks, less than 1% of 
the test sentences required fitting. Even after 40,000 training sentences, each new sentence 
produced, on the average, a new rule. The parsing rules were mapped to an automaton 
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in order to handle their large number, best-first search and Viterbi search were used for 
optimization. 

The maximum-entropy parser by [Ratnaparkhi (19971) , operates in three passes: tagging. 



chunking, and building hierarchical structure. A maximum-entropy model is employed at 
each phase. The features which the models test are 'contextual predicates' - word, POS or 
chunk-tag of neighboring (up to a distance of 2) words, or features which represent hierar- 
chical dependency. Each pass prepares information for the next one. The performance of 
the system on the data on which SPATTER was tested is 86.3% recall and 87.5% precision. 

Data Oriented Parsing (DOP), presented by |Bod (1992| ), is another example of an algo- 
rithm based purely on hierarchical structure information. It relies on the idea of producing a 
parse-tree by combining sub-trees from a memory of previous parses. For example, suppose 
the training data contained the parsed sentences: 

(S (NP (DT The) (JJ pretty) (NN bird)) (VP (VB sings))) 
(S (NP (DT The) (NN airplane)) (VP (VB flies))) 

The system memory would contain all the sub-trees of these parse trees; some of them are: 

(S (NP) (VP)) 

(VP (VB)) 

(NP (DT) (NN)) 

(NP (DT) (JJ) (NN)) 

(DT the) 

(NN bird) 

(VB flies) 

Given the new sentence: 'The bird flies', the parse tree 
(S (NP (DT the) (NN bird)) (VP (VB flies))) 

can now be built by combining the sub-trees from the memory. 

In the general case, each sub-tree is scored according to its frequency relative to other 
sub-trees with the same root. Then, the various parse alternatives are scored according 
to the scores of their building blocks. Finding the best parse is NP-hard ( pima'an, 1996| ), 
therefore the DOP parsing algorithm uses a Monte-Carlo approximation and the resulting 
parse is an estimation of the best one. 

Note that DOP does not rely on a given set of parsing rules (nor does it build such a 
set); instead, it reads the parsing trees as raw data and uses them as building blocks. The 
algorithm presented in this paper is similar to DOP in that it tries to build evidence for 
bracketing an input sequence based on matching subsequences of the target pattern stored 
in the memory. 

2.3 Learning: Shallow Parses 

A number of systems have been developed for learning to perform shallow parsing. Most 
of these systems learn to identify chunks such as noun phrases, while some others learn to 
identify relationships between chunks (or representative words thereof). 
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Church (1988) uses a simple model for finding base (non- recursive) NPs in a sequence 
of POS tags. Viewing the task as a bracketing problem, he calculated the probability of 
inserting open/close brackets between POS tags. A sequence of POS tags, representing a 
sentence, may be chunked into base NPs in various ways. Each chunking alternative is scored 
using the probabilities, the best alternative is then chosen. The algorithm was tested on 
the Brown corpus; while exact results are not reported, they are described as encouraging. 
In particular, that motivates using POS-tag information alone for NP detection. That is 
important because the few dozens of POS tags require much less resources than any collection 
of lexical information. 

Ramshaw and Marcus (1995, hereafter RM95) viewed the problem as classification of 
words. Each word was assigned a chunk-tag: T or 'O' for words inside or outside a chunk, 
and 'B' for words which stand at the beginning of a base NP that immediately follows 
another base NP. Thus, in contrast to ( [Church, 1988|) , where the sentence was chunked based 
on a global criterion (the chunking with the higher score), here the decision is local. RM95 
used transformation-based learning ( Prill, 19"9^) , with rule-templates referring to neighboring 
words, POS tags, and chunk tags (up to a distance of 3 for words or POS tags, and 2 for 
chunk tags). Their work is the first one to present large-scale testing of base NP learning. 
Training on 950K words from the Penn TreeBank WSJ data tagged by Brill's POS tagger 
( Prill, 1992|) , with a test set of 50K words, they achieved a recall of 93.5% and precision of 
93.1% . RM95 demonstrate that the contribution of the lexical information is about 1%, the 
main information sources were thus the neighboring POS and chunk tags. They also report 
a chunk-tagging accuracy of 97.8%. 

Veenstra (1998) recently presented an application of the IGTree ( Daelemans, van den 
Bosch, and Weijters, 1996D memory-based learning algorithm for NP chunking. Using the 
data of RM95, POS tagged by Memory-Based Tagger ( paelemans et al., 1996| ), he also 
assigned a chunk tag (T, 'O', 'B') to each word. The features included lexical and POS 
information at a distance of up to two words. Some variants have generally yielded higher 
recall and lower precision than RM95, see table ^ for details. While both works used infor- 
mation about the near context as features, a further correction phase was required in order 
to make sure that the chunk tagging yields a proper chunking (e.g. a 'B' tag cannot follow an 
'O' tag). That phase essentially makes use of information which may result from decisions 
made for quite far words. 

Another memory-based approach was presented by [Cardie and Pierce (199^) . They 
created a set of grammar rules from the NP training inventory, and pruned it using a separate 
corpus. In the inference phase, the longest matching rule was applied. The system accepts 
POS-tag strings, that is, it does not handle lexical information. Direct comparison with 
the results of RM95 was not presented, but cross-validation results on NPs created in a 
similar fashion yielded 91.1% recall and 90.7% precision - similar to what RM95 obtained 
ignoring lexical data. Cardie and Pierce have used a pruning methodology which discards 
the ten worst rules until precision drops, as well as local repair heuristics which improved 
the precision by 1% without harming the recall. Their method works better, as they note, 
on simpler NPs. 

Skut and Brants (1998) worked also at the word- level, but with features which include 
hierarchical structure information as well. They used the maximum-entropy method for 
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learning to assign words with a triple structural tag: a POS tag, syntactic category, and a 
tag which represents the relation between dominating nodes of the current and the preceding 
words. The triple nature of the features makes it possible to use only some of the three tags - 
thereby obtaining a more general and, possibly, meaningful feature. Weights of the structural 
features are evaluated using the maximum-entropy method, then incorporated into a trigram 
model. The reported results for NP and PP chunking of a German text are 88.9% recall and 
87.6% precision. 

3 The Algorithm 

The input to the Memory-Based Sequence Learning (MBSL) algorithm is a sentence repre- 
sented as a sequence of POS tags, and its output is a bracketed sentence, indicating which 
subsequences of the sentence are to be considered instances of the target pattern {target 
instances). The training corpus consists of pre-bracketed sentences, it is used for creating 
the memory. MBSL (see figure p determines the bracketing by first considering each sub- 
sequence of the sentence as a candidate to be a target instance. For each candidate c, it 
computes a likelihood score /c(c) by searching the memory, keeping c if its score is above a 
threshold 9c- The algorithm then finds a consistent bracketing for the input sentence, giving 
preference to subsequences whose score was high. In the remainder of this section we will 
describe the components of the algorithm in more detail. 

insert figure 1 here 
3.1 Scoring candidates 

We first describe the mechanism for scoring candidates. The input is a candidate subse- 
quence, along with its context, i.e. surrounding tags in the input sentence. We derive a score 
for the candidate by searching substrings of the candidate and its context in the training 
corpus. 

Consider first the case when the candidate subsequence as a whole occurs in the training 
corpus as an actual target instance. For example, consider the candidate ADJ NN NN, where 
the training corpus contains the bracketed sentence 

[ NN ] VB [ ADJ NN NN ] RB PP [ NN ] . 

Here the candidate sequence is a target instance in the training sentence. This fact consti- 
tutes a positive evidence that the candidate is indeed a target instance. The evidence would 
be even stronger if the candidate's context in the input sentence was the same as that in 
the training sentence. However, if the candidate (with or without context) also occurs in 
the training not as a target instance, we would also have negative evidence indicating that 
the candidate might not be a target instance. For example, the candidate NN NN RB occurs 
negatively in the training sentence above. 

We may generalize this notion by considering positive and negative evidence to include 
substrings of the candidate and its context. For example, if the sequence ADJ NN often 
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occurs at the beginning of a noun phrase in the training, we have some evidence that a can- 
didate beginning with that sequence is a noun phrase. For example, although the candidate 
ADJ NN NN NN does not occur as a whole in the training sentence above, the sentence gives 
positive evidence for the candidate, in the form of the prefix subsequence ADJ NN NN and 
the suffix subsequence NN NN. 

The basic idea, then, is to find the set of subsequences of the candidate and its context 
which provide positive evidence. This set is then used to compute the candidate score. 

3.1.1 Candidates and tiles 

The MBSL scoring algorithm works by considering situated candidates. A situated candidate 
is a POS sequence containing a pair of brackets (' [['■■■ ']] '), indicating a candidate for a 
target instance. The portion of the sentence between the brackets is the candidate (as above), 
while the portions to the left and to the right of the candidate are its conteod. 

The idea of the MBSL scoring algorithm is to construct a tiling of subsequences of a 
situated candidate which covers the entire candidate. We consider as tiles all subsequences 
of the situated candidate which contain at least a left or right bracket. Thus we only 
consider tiles within or adjacent to the candidate that also include a candidate boundary. 
This constraint reduces the computational complexity of the algorithm (and the run-time 
in practice), while taking into account the primary evidence for a target instance at the 
instance boundaries. 

Although in principle our algorithm may work with unlimited context, in practice we only 
consider a fixed amount of maximal left and right contexts. Let the context limit be denoted 
by crz, the total number of tiles for a candidate of length I is ntues — 2-cn- {l + 2) + 2l + cn'^ + l. 
For a fixed maximal context, ntiies —0{l) , whereas for a fixed length the number of tiles grows 
as cn^. A large an will therefore yield an unmanageable number of tiles. That is the reason 
for considering a fixed maximal context, we have used values of 2 or 3 in our experiments. 

Each tile is assigned a score based on its occurrences in the training memory. Since 
brackets correspond to boundaries of potential target instances, it is important to consider 
how the bracket positions in the tile correspond to those in the training memory. 

For example, consider again the training sentence 

[ NN ] VB [ ADJ NN NN ] RB PP [ NN ] . 

We may now examine the occurrence in this sentence of several possible tiles: 
VB [ ADJ NN occurs positively in the sentence, and 
NN NN ] RB also occurs positively, while 

NN [ NN RB occurs negatively in the training sentence, since the bracket is in a different 
position. 

The positive evidence for a tile is measured by its positive count, the number of times the tile 
occurs in the training memory with corresponding brackets. Similarly, the negative evidence 
for a tile is measured by its negative count, the number of times that the POS sequence of 
the tile occurs in the training memory with non-corresponding brackets (either brackets in 
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the training are in a different position tlian in tlie tile, or witliout brackets). Tlie total count 
of a tile is its positive count plus its negative count. The score frit) of a tile t is defined as 
the ratio of its positive and total counts (other functions could also easily be considered). 

J. pos_count(t) 
totaLcount(t) 

We say that a tile whose score frii) exceeds a given threshold 9t, and thus has 'sufficient' 
positive evidence, is a matching tile. Each matching tile gives supporting evidence that a 
part of the candidate should be considered part of a target instance. In order to combine this 
evidence, we try to cover the entire candidate by a set of matching tiles, with no gaps. Such 
a covering constitutes evidence that the entire candidate is a target instance. For example, 
consider the matching tiles shown for the candidate in figure |[ The set of matching tiles 2, 
4, and 5 covers the candidate, as does the set of tiles 1 and 5. Also note that tile 1 constitutes 
a cover on its own. 

insert figure 2 here 

To make this precise, we first say that a tile ti connects to a tile t2 if (i) ^2 starts after ti, 
(ii) there is no gap between the end of ti and the start of t2 (there may be some overlap), 
and (iii) t2 ends after ti (neither tile includes the other). For example, tiles 2 and 4 in the 
figure connect, while tiles 2 and 5 do not, and neither do tiles 1 and 4 (since tile 1 includes 
tile 4 as a subsequence). 



3.1.2 Cover statistics 

A cover for a situated candidate c is a sequence of matching tiles which collectively cover 
the entire candidate, including the boundary brackets and possibly some context, such that 
each tile connects to the following one. A cover thus provides positive evidence for the entire 
sequence of tags in the candidate. 

The set of all covers for a candidate summarizes all of the evidence for the candidate as 
a target instance. The score of a candidate is therefore a function of statistics of the set of 
all its covers. For example, if a candidate has many different covers, it is more likely to be 
a target instance, since many different pieces of evidence can be brought to bear. 

We have empirically found several statistics of the cover set to be useful. These include, 
for each cover, the number of matches it contains, the total number of context tags it contains, 
and the number of positions which more than one match covers (the amount of overlap). We 
thus compute, for the set of all covers of a candidate c: 

• The tootal number of different covers, num(c), 

• The least number of tiles constituting a cover, minsize(c), 

• The maximum amount of total context in any cover (left plus right context), 
maxcontext(c), and 

• The maximum over all covers of the total number of tile elements that overlap between 
connecting tiles, maxoverlap(c). 
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Each of these items indicates the strength of the evidence which the covers provide for the 
candidate. 

In order to compute a candidate's statistics efficiently, we construct the cover graph of 
the candidate. The cover graph is a directed acychc graph (DAG) whose nodes represent 
tiles of the candidate, such that an arc exists between nodes v{ti) and f (^2) whenever tile ti 
connects to ^2- Two special nodes, START and END, are added to the cover graph. START 
is connected to to every node whose tile contains an open bracket, and similarly, every node 
representing a tile containing a close bracket is connected to END. 

It is easy to see that each path from START to END constitutes a cover, and that every 
cover gives such a path. Therefore the statistics of all the covers may be efficiently computed 
by a depth-first traversal of the cover graph. An algorithm for constructing the cover graph 
is given in figure 0. 

insert figure 3 here 

A graph whose structure is similar to that of the cover graph was used by [Dedina and 



Nusbaum (1991|) and [Yvon (1996| ) for representing alternative pronunciations of a word. 



In the application of Yvon (1996| ), the nodes are labelled with phoneme sequences and 



arcs connect nodes whose labels overlap. START and END nodes are connected to prefix 
and suffix nodes respectively; each possible pronunciation thus corresponds to a path from 
START to END. Nevertheless, the scoring mechanism is different as the aim is to find a best 
path rather than calculating statistics of a graph. Similar with candidate score, though, the 
path scoring function of Yvon (1996|) also prefers more overlap and short paths. 



3.1.3 Computing the candidate score 

The score of the candidate is a linear function of its cover statistics: 

/c(c) = a num(c) — minsize(c) + 
7 maxcontext (c) + 
6 maxoverlap(c) 

If candidate c has no covers, /c(c) = 0. Note that minsize is weighted negatively, since a 
cover with fewer tiles provides stronger evidence for the candidate. The candidate scoring 
algorithm is given in figure ^ 

In the current implementation, the weights were chosen so as to give a lexicographic 
ordering on the features, preferring first candidates with more covers, then those with covers 
containing fewer tiles, then those with larger covered contexts, and finally, when all else is 
equal, preferring candidates whose covers have more overlap between connecting tiles. We 
plan to investigate in the future a data-driven approach (based on the Winnow algorithm 
( [Littlestone, 1988|) ) for optimal selection and weighting of statistical features of the score. 



insert figure 4 here 



3.1.4 Complexity 



We analyse the worst-case complexity of creating the cover graph and computing the score 
of a situated candidate in terms of the candidate length 1. The steps of that process are: 
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MBSL(5entence, memory, 9t, 6c) 

1. Consider each subsequence of sentence as a candidate: 

2. Construct a situated candidate s from the candidate by prepending left context 
followed by ' [[' and appending ']]' followed by right context; 

3. Evaluate /c(c) =CandidateScore(s, memory, 9t); 

4. If /c(c) > Oc, add c to candidate set; 

5. Select Candidates(canc?ic?ate set). 

Figure 1: The top-level MBSL bracketing algorithm. 

Candidate: NN VB [ AD J NN NN ] RB 

MTile 1: VB [ AD J NN NN ] 

MTile 2: VB [ AD J 

MTile 3: [ ADJ NN 

MTile 4: NN NN ] 

MTile 5: NN ] RB 

Figure 2: A candidate subsequence, and 5 matching tiles found in the training corpus. 

Co'verGraph{situated-candidate, 9t) 

1. {START}; 

2. For each subsequence t (potential tile) of situated- candidate including 
either ' [['or ']] ': 

3. SearchTile(memor?/, t) to get positive and total counts; 

4. If frit) > Ot 

5. add a vertex v{t) to V; 

6. £;^0; 

7. For each pair of vertices v{ti),v{t2) G V such that ti connects to t2. 

8. Add the arc {v{ti),v{t2)) to E; 

9. Return the graph {V^E). 

Figure 3: Constructing the cover graph. 
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Create vertices: Since tliere are 0(/) potential tiles t in the situated candidate, and search- 
ing for a tile in the memory takes linear time (see below in section |3.4|) , this step could 
take 0(/2). 

Create edges: There are at most 0(/^) possible edges in the graph, so this step takes 0(/^). 
Computing statistics: The DFS takes 0{\V\ + \E\)=0{1^). 
Computing the score: This takes a constant time. 

Hence computing the candidate score for a situated candidate takes 0(/^) in the worst case. 
In practice, however, this worst case is rarely reached, since usually only a small fraction of 
the tiles are matching. 



3.2 Selecting candidates 

In order to select a bracketing for the input sentence, we assume that target instances are non- 
overlapping (this is the case for the types of linguistic patterns with which we experimented). 
To apply this constraint we use a simple constraint propagation algorithm that finds the best 
choice of non-overlapping candidates in an input sentence, given in figure ^. Other methods 
for determining the preferred bracketing may also be applied; this remains a topic for future 
research. 

insert figure 5 here 

We analyse the complexity of candidate selection by first noting that the number of 
candidates in the candidate set is at most O(n^) where n is the length of the input sentence. 
The first step in selection is to sort the candidate set by fc, which takes 0(n^logn^) = 
O(n^logn). Since there are at most 0{n) candidates in the final bracketing, with proper 
indexing removing overlapping candidates takes a total of at most O(n^) time, giving a total 
time of 0(n^ logn). 



3.3 Worst-case complexity of MBSL 

The overall complexity of the bracketing algorithm (figure |l|) may now be easily computed. 
For a sentence of length n, there are O(n^) candidates, each of length at most 0(n). Hence, 
since scoring a single candidate of length / takes 0(/^), scoring all of the candidates takes 
at most 0(n'^). Since the selection step takes O(n^logn), the worst case complexity of the 
MBSL bracketing algorithm is O(n^). In practice, however, this worst-case is rarely reached. 



3.4 Implementing the training memory 

The MBSL scoring algorithm above needs to search the training corpus for many subse- 
quences of each input sentence in order to find matching tiles. Implementing this search 
efficiently is therefore of prime importance. We do so by encoding all possible tiles for each 
target instance (positive examples) in the training corpus using a trie data structure. A trie 
is a tree whose arcs are labeled with characters such that each path from the root represents 
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a distinct tile in the training. A trie allows searching for a sequence in time linear in the 
length of the sequence. Each node in the trie represents the sequence of arc labels along the 
path reaching it from the root. We store the positive and total counts for each tile at its 
associated node in the trie. 

Given a trie as described, determining the positive and total counts of a given potential 
tile is done by searching the trie for the tile, and simply returning the counts stored at the 
node found. This takes time linear in the length of the potential tile. If tile ti is a prefix of 
^2, then searching for t2 after searching for ti may be done incrementally, by starting from 
tis node. Hence searching for a set of tiles all starting at the same point in the text is 
performed incrementally in linear (rather than quadratic) time. 

The memory trie itself is built in two passes. Their worst-case complexity depends on k, 
the number of sentences in the corpus, and n, the maximum sentence length. 

In the first pass, all the tiles of each target instance in the corpus are inserted into the 
trie. For example, considering a maximum context of 1, given the NP instance 

VB [ NN ] IN 

the possible tiles are: 

VB [ 
VB [ NN 
VB [ NN ] 
VB [ NN ] IN 

[ NN 

[ NN ] 

[ NN ] IN 
NN ] 
NN ] IN 
] IN 

For each possible tile t, if t is not in the trie, new nodes required for constructing a path 
representing t are added, and the positive count of the final node of the path is set to 1. 
Otherwise, if t is already in the trie, the positive count of its associated node is increased 
by 1. This search is performed incrementally for tiles sharing a prefix and so adding all tiles 
starting at a given point in a target instance takes time linear in the length of the tile (at 
most 0(n)). Thus, since there are at most 0(n) tile starting points in a sentence, the first 
pass takes time at most 0{kri^). 

In the second pass, we compute the total counts for each node by considering all the 
subsequences of every sentence in the corpus. For each such subsequence s, we find each 
node in the trie whose associated subsequence is identical to s except possibly for the addition 
of brackets. We then increment the total count for each such matching node. Since each 
tile is a substring of a sentence, there are at most O(n^) ways to construct a possible tile for 
s. Therefore, there are at most O(n^) matching nodes in the trie. Here too we search the 
trie for all subsequences that share a starting point incrementally, and so the search takes 
time O(n^) for each starting point, and 0{kn^) for the entire corpus. Therefore the overall 
worst-case complexity of building the memory is 0{kn^). Although this complexity looks 
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large, in practice the worst case is not really reached. The reason is that pattern instances, 
hence tiles, are commonly much shorter than the sentence in which they are embedded. The 
actual time required for building the memory is not long, for example: in the experiments 
on NPs, the training data contain 8936 sentences, 54760 patterns, and 229,598 words. With 
a maximum context of 3, it took 16 seconds to build the memory on a 300 MHz Pentium 
II running Linux. The bracketing rate was 1000 sentences per minute, and the trie took up 
100MB of memory. 

In a previous implementation ([Argamon, Pagan, and Krymolowski, 199^) we encoded 
the entire corpus in two suffix trees. Inspired by |Satta and Henderson (1997] ), one suffix tree 
encoded the positive examples and the other encoded the entire corpus, and was used for 
obtaining the total counts of tiles. Following Roth (1998b ) we have decided to use a trie 
of positive tiles only. While using suffix trees gives a better worst-case time complexity for 
building the memory, using a trie improves performance by 25-30%. This is mainly because 
the trie is used for storing positive examples only. 



4 Evaluation 
4.1 The Data 

We have tested our algorithm in recognizing three syntactic patterns: noun phrase sequences 
(NP), verb-object (VO), and subject- verb (SV) relations. The training and testing data were 
derived from the Penn TreeBank. 

We used the NP data prepared by RM95, these data were tagged by Brill's POS tagger. 
The SV and VO data were extracted using T (TreeBank's search script language) scripts 
(available at pittp : //www, cs .biu. ac . il/^yuvalk/MBSLl ). Table |I| summarizes the sizes of 



the training and test data sets and the number of examples in each. The T scripts did not 
attempt to match dependencies over very complex structures, since we are concerned with 
shallow, or local, patterns. 

insert table 1 here 

For VO patterns, we put the starting delimiter just before the main verb and the ending 
delimiter just after the object head. For example: 

. . . investigators started to 
[ view the lower price levels ] 
as attractive . . . 

We used a similar policy for SV patterns, defining the start of the pattern at the start of the 
subject noun phrase and the end right at the first verb encountered. For example: 

. . . argue that 

[ the U.S. should regulate ] 
the class . . . 

The subject and object noun-phrase boundaries were those specified by the Penn TreeBank 
annotators, and phrases containing conjunctions or appositives were not further analysed. 



14 



CandidsLteScoTe{situated-candidate,memory, 6t) 

1. Let G — CoverGr aph{situated- candidate, 9t); 

2. Compute num, minsize, maxcontext, and mctxoverlap by performing DFS on G; 

3. Return the candidate score fc as a function of these statistics. 

Figure 4: The candidate scoring algorithm 



SelectCandidates( canc?ic?ate-set) 

1. Examine each candidate c & candidate- set such that /c(c) > 0, in descending order of 
fc{c): 

2. Add c's brackets to the sentence; 

3. Remove all candidates overlapping with c from candidate set; 

4. Return the candidates remaining in candidate-set as target instances. 

Figure 5: Candidate selection algorithm. 



Train Data: 





sentences 


words 


patterns 


NP 


8936 


229598 


54760 


VO 


16397 


454375 


14271 


sv 


16397 


454375 


25024 




Test Data: 






sentences 


words 


patterns 


NP 


2012 


51401 


12335 


VO 


1921 


53604 


1626 


SV 


1921 


53604 


3044 



Table 1: Sizes of training and test data 
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By terminating the SV pattern right after the first verb encountered we get seemingly 
incorrect relation descriptions such as: 

[ the bird which stands ] there sings 

While the actual SV pair is bird - sings, the T script extracts bird - stands. This pair 
of words is not a properly grammatical SV, but it contains important information and may 
be regarded as grammatically meaningful in its own right. Hence we do not believe that it 
impinges on the evaluation of our learning method. 

Table ^ shows the distribution of pattern length in the train data. We did not attempt to 
extract passive- voice VO relations. We have extracted SVs from 92.4% of the sentences and 
VOs from 58.5% (these figures refer to the overall train and test data). Almost all (99.8%) of 
the sentences in the NP data contain a noun-phrase. Table § shows the number of examples 
for corpus sizes of 50K, lOOK, 150K, and 200K. 

insert table 2 here 

insert table 3 here 



4.2 Testing Methodology 

The test procedure has two parameters: (a) maximum context size of a candidate, which 
limits what queries are performed on the memory, and (b) the threshold 6t used for es- 
tablishing a matching tile, which determines how to make use of the query results. (The 
candidate score threshold, 6c was set to 0, such that every candidate with at least one cover 
was considered.) 

Recall and precision figures were obtained for various parameter values. F^, a common 
measure in information retrieval ( |van Rijsbergen, 1979|) , was used as a single- figure measure 
of performance: 

{(3' + l)-P-R 

We use (3 which gives no preference to either recall or precision. 

Performance may be measured also on a word-by word basis, counting as a success any 
word which was identified correctly as being part of the target pattern. That method was 
employed, along with recall/precision, by RM95. We preferred to measure performance by 
recall and precision for complete patterns. Most errors involved identifications of slightly 
shifted, shorter or longer sequences. Given a pattern consisting of five words, for example, 
identifying only a four-word portion of this pattern would yield both a recall and precision 
errors. Tag- assignment scoring, on the other hand, will give it a score of 80%. We hold the 
view that such an identification is an error, rather than a partial success. 



4.3 Results 

Table ^ summarizes the optimal parameter settings and results for NP, VO, and SV on the 
test set. In order to find the optimal values of the context size and threshold, we tried 
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Len 


NP 


% 


vo 


% 


sv 


% 


1 


16959 


31 










2 


21577 


39 


3203 


22 


7613 


30 


3 


10264 


19 


5922 


41 


7265 


29 


4 


3630 


7 


2952 


21 


3284 


13 


5 


1460 


3 


1242 


9 


1697 


7 


6 


521 


1 


506 


4 


1112 


4 


7 


199 





242 


2 


806 


3 


8 


69 





119 


1 


592 


2 


9 


40 





44 





446 


2 


10 


18 





20 





392 


2 


>10 


23 





23 





1917 


8 


total 


54760 




14271 




25024 




avg. len 


2.2 




3.4 




4.5 





Table 2: Distribution of pattern lengths, total number of patterns and average length in the 
training data. 



Words 


NP 


VO 


SV 


50K 


12100 


1603 


2764 


lOOK 


23864 


3180 


5385 


150K 


35799 


4798 


8237 


200K 


47730 


6313 


11034 



Table 3: Number of pattern instances within 50K to 200K words 
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0.1 < 9t < 0.95, and maximum context sizes of 1,2, and 3. Our experiments used 5-fold 
cross-validation on the training data to determine the optimal parameter settings. 

insert table 4 here 

In experimenting with the maximum context size parameter, we found that the difference 
between the values of Fp for context sizes of 2 and 3 is less than 0.5% for the optimal 
threshold. Scores for a context size of 1 yielded Fp values smaller by more than 1% than the 
values for the larger contexts. 

Figure |^ shows recall/precision curves for the three data sets, obtained by varying 6t 
while keeping the maximum context size at its optimal value. The difference between F^=i 
values for different thresholds was always less than 2%. 

insert figure 6 here 

Figure |^ shows the learning curves by amount of training examples and number of words 
in the training data, for particular parameter settings. 

insert figure 7 here 

The results of RM95 are shown in table H for comparison, along with those of |Veenstra 



1998|) . All the cited results pertain to a training set of 229,000 words. Using a larger training 
set, of 950,000 words, RM95 attained a recall/precision of 93.5%/93.1% (F^=93.3%); this 
last result was achieved including lexical information. 

Two results from RM95 are presented: for learning with and without lexical information. 
Only the latter should be compared with the results of our algorithm, because it also does 
not consider lexical information. All results for NPs are quite similar, up to recall/precision 
tradeoff. That may be related to limitations inherent in the particular choice of train/test 
data. 

Some of the errors resulted from noun-phrases beginning with a preposition, or an adverb, 
such as: 

[ IN About CD 20,000 NNS years ] RB ago 



[ RB yet DT another NN setback ] 

When we focus on nouns, preceding prepositions and adverbs are not of interest. We have 
made an experiment in which these words were taken out of NPs (and NPs composed of one 
word tagged IN or RB (e.g. IN that) were ignored). This experiment yielded better results, 
as presented in table ^ at the line marked NP'''. 

The Penn TreeBank data contain trace markers, tagged as -NONE-. These would not 
appear in a raw text from another source. In order to see the effect of the trace markers 
we have removed them and experimented with the optimal parameters. The results are 
summarized in table ^ The more poor results indicate that the trace markers actually 
improved the inference quality, especially for the VO pattern. 

Both RM95 and Veenstra approach the chunking task as a classification task, using 
chunk-tags. Both works use chunk-tag assignments of nearby (up to a distance of 2) words 
as features, and employ a post-processing step for assuring that the chunk-tags yield proper 
chunking (for example, correct cases where a chunk is opened within another chunk). Since 
chunk tags of preceding words were obtained using information about even earlier words - 
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Con. 


Thresh. 


Break Even 


Recall (%) 


Precision (%) 




vo 


2 


0.5 


81.3 


89.8 


77.1 


83.0 


sv 


3 


0.6 


86.1 


84.5 


88.6 


86.5 


NP 


3 


0.6 


91.4 


91.6 


91.6 


91.6 


NPt 


3 


0.6 




92.4 


92.4 


92.4 


RM95 (NP) 








90.7 


90.5 


90.6 


+ Lex 








92.3 


91.8 


92.0 


Veenstra (NP) 








94.3 


89.0 


91.6 



Table 4: Results with optimal parameter settings for context size and threshold, and 
breakeven points. The NP"'' line presents results obtained when removing adverbs and prepo- 
sitions from the beginning of NPs (section The last two lines show the results of RM95 
(with and without lexical features) and Veenstra (1998, using a different POS tagger) rec- 
ognizing NPs results with the same train/test data. The optimal parameters were obtained 
by 5-fold cross-validation. Note that the results for SV and VO were obtained from Penn 
TreeBank data which include trace markers. In table ^ we present results without the trace 
markers. These results are degraded, especially for VO. 




70 80 

Recall 



Figure 6: Recall-Precision curves for NP, VO, and SV; 0.1 < 9 < 0.99 
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using them as features implies that information about preceding words at larger distances is 
taken into account. In these approaches there is an asymmetry between words to the left, 
and words to the right of the current word, as tagging is carried out from left to right. 

Our method differs from these methods in that it considers sequences rather than indi- 
vidual words. Tiles of any length may contribute to the decision, with no preference to the 
beginning or ending of a pattern (in contrast with Cardie and Pierce (1998|) who used only 
the prefixes of NPs) . That makes the presented algorithm distinct from finite-state methods, 
which are directional in nature. 

Indeed, we tried to learn a stochastic transducer at an earlier phase of the research, using 
the method of [Ron, Singer, and Tishby (1995|) . We trained the transducer for detecting 
NPs in POS sequences, with the sentence presented at different offsets so the transducer 
could identify different NPs. That method, however, yielded poor results, possibly due to 
the difficulty of combining generalizations about both the beginning of an NP and its end. 



4.4 Common Errors 

This section contains a brief account of typical error-sources for the NP, SV, and VO learning 
tasks. In the examples shown [ ] bracket the true pattern instance, whereas [ [ ] ] bracket 
the one found by the algorithm. 

The NP data were tagged automatically, that certainly introduced some errors. Since the 
tagging errors are consistent throughout the train and test data, their effect is to increase 
data sparseness. Most of the errors in the NP learning task were due to: 

• Coordinations and punctuation marks: Depending on the context, some NPs which 
contain a coordination were split and some NPs separated by a coordination were 
detected as a single NP. 



, , [ [[ DT a JJ local NN lawyer ]] CC and 
[[ JJ human-rights NN monitor ]] ], , 



NN relationship IN between [[ [ NNP Mr. NNP Noriega ] 
CC and NNP [ Washington ] ]] IN that 

In order to estimate the effect of coordinations we made a test in which all NPs con- 
taining CC's were split to their more basic components. That increased the number of 
phrases by 2%, and the recall/precision to 93.4% and 92.9% respectively. However, we 
are then faced with the new problem of deciding whether or not to create a single NP 
from two NPs connected by a coordination. 

• Adjacent phrases: Cases where two adjacent NPs were detected as a single phrase 

VBZ recalls [[ [ PRP$ his NN close NN ally ] 
[ NNP Jose NNP Blandon ] ]] , , 
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• Ambiguity of verbs: Words tagged VBG or VBN may be verbs as well as modifiers. The 
way these words are handled depends on the context 

, , [VBN fixed [[ VBG leading NNS edges]] ] IN for 

Note that leading was part of the detected NP, probably due to the VBN context, vs. 

PRP it VBD had [VBG operating [[ NN profit ]] ] IN of $ $ CD million 

where the context is different. A proper treatment of this issue has to take into account 
lexical information, not only POS data. 

• Quotation marks: Many of the errors involved phrases which contain quotation marks 

IN by [ " " [[ DT a RB very J J meaningful " " 
NN increase ]] ] IN in 

A test with all quotation marks completely removed have not yielded any significant 
improvement in the results. 

In the SV and VO learning tasks, the text was taken directly from Penn TreeBank, hence 
the number of tagging errors was smaller. We have made tests with quotation marks removed 
for these tasks as well, using the optimal parameters. The result was a degradation of the 
precision by 1% for SV and 0.5% for VO, along with a slight increase of the recall (0.2% for 
SV and 0.4% for VO). These effects are insignificant. 

The VO patterns did not take in account cases where the verb is a be-verb, such as the 
pair is-chairman in 

NNP Mr. NNP Vinken VBZ is NM chairman IN of 

Using POS tags only, it is impossible to distinguish these verbs. We therefore conducted 
an experiment in which be, am, is, are, were, and was (and their abbreviations) were 
tagged with the new POS tag VBE. That was a rudimentary way of introducing lexical 
knowledge. The results of this experiment are presented in table ^. The incorporated 
knowledge caused an improvement of 1.5% in the recall and a more significant precision 
improvement of 4.3% . 

A significant part of the remaining VO recognition errors were results of verbs appearing 
as modifiers rather than actions: 

NN asbestos [[ VBG including NN crocidolite ]] RBR more RB stringently 
DT all [[ VBG remaining NNS uses ]] IN of 

TO to [ VB indicate [[ VBG declining NN interest NNS rates ]] ] 
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Regarding SV learning, most of the recall errors involve very long pattern instances. As 
table H shows, about 8% of the SV patterns are more than 10 words long. With such length, 
the diversity is large and it is harder to model all the cases. Similarly, many of the precision 
mistakes are a result of erroneously detecting relatively short instances, or sub-parts of a 
pattern. The variety of possible tiles increases the chance of false detections. 

insert table 5 here 

5 Conclusions 

We have presented a novel general algorithm for learning sequential patterns. Applying 
the method to three syntactic patterns in English yielded positive results, suggesting its 
applicability for recognizing syntactic chunks. The results for noun-phrase learning are 
compatible with current state of the art. The algorithm achieved good results also when 
applied to patterns which constitute a relation between words (subject- verb, verb-object), 
rather than a chunk. 

Many of the errors can be attributed to the lack of lexical and semantic information, as 
the input consists only of POS tags. These errors point out sub-problems (e.g. ambiguity of 
gerunds) which cannot be handled using POS information alone. 

The presented algorithm is part of a more comprehensive learnable shallow parser under 
plan. The planned shallow parser will handle sequential patterns (chunks) as well as attach- 
ment problems (e.g. prepositional phrase attachment). Attachment problems, like many of 
the cases where POS data are not enough, require lexical information. We plan to use SNOW 
( [Roth, 1998a| ), a feature-based network learning architecture based on Winnow, for learning 
to handle such problems. SNOW can also be used within our algorithm, for a principled 
candidate scoring through weighting the statistical features of candidates. 
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Recall (%) 
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VO 


2 
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vo^^ 


2 


0.5 
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SV 


3 
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Table 5: SV and VO results without trace markers, notice the degradation compared with 
the original data. The line labelled VO^^ presents results obtained with all be-verbs tagged 
by a special POS tag. 
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